The Specification of POS Tagging of the Hong Kong University Cantonese Corpus
نویسنده
چکیده
The Hong Kong University Cantonese Corpus was collected from transcribed spontaneous speech in conversations and radio programs that involved two to four people. It was wordsegmented, annotated with Cantonese pronunciation, and recently tagged with word classes by adopting the parts-of-speech (POS) scheme of Yu et al. (2002). This scheme, which was designed for tagging written Mandarin texts, encountered some problems in tagging spoken Cantonese. However, it is flexible for further expansion of the 26 basic word classes by customizing some subclasses for annotating other Chinese dialects (e.g., Cantonese). Its robustness was proved by the annotation of approximately 230,000 words in the HKUCC. This article will describe the format of the corpus and provide the specification that helps annotators in POS tagging and will solve problems encountered in manual annotation. Guidelines of tagging some word classes will be introduced, followed by the discussion of easily confused tags, illustrated with examples from the corpus. Further work will aim at automatic annotation by computers in order to facilitate the work of POS tagging of Cantonese and other Chinese dialects. The corpora of Hong Kong Cantonese are quite lacking. Past work focused either on a POS-tagged corpus for child language or the phonetic transcription of an adult Cantonese corpus. HKUCC fills the gap by providing a POS-tagged corpus for adult Cantonese and is believed to be of great value to the data-driven linguistic analysis and natural language processing for Cantonese.
منابع مشابه
An improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملبرچسبگذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی
Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...
متن کاملChinese Tagging Based on Maximum Entropy Model
In the Fourth SIGHAN Bakeoff, we took part in the closed tracks of the word segmentation, part of speech (POS) tagging and named entity recognition (NER) tasks. Particularly, we evaluated our word segmentation model on all the corpora, namely Academia Sinica (CKIP), City University of Hong Kong (CITYU), University of Colorado (CTB), State Language Commission of P.R.C. (NCC) and Shanxi Universit...
متن کاملCorpus-based learning of Cantonese for Mandarin speakers
This paper reports our experience in using a parallel corpus to teach Cantonese, a variety of Chinese spoken in Hong Kong, as a second language. The parallel corpus consists of pairs of word-aligned sentences in Cantonese and Mandarin Chinese, drawn from television programs in Hong Kong (Lee, 2011). We evaluated our pedagogical approach with Mandarin-speaking students at a university course. Fo...
متن کاملThe Contribution of Ageing to Hospitalisation Days in Hong Kong: A Decomposition Analysis
Background Ageing has become a serious challenge in Hong Kong and globally. It has serious implications for health expenditure, which accounts for nearly 20% of overall government expenditure. Here we assess the contribution of ageing and related factors to hospitalisation days in Hong Kong. We used hospital discharge data from all publicly funded hospitals in Hong Kong between 2001 and 2012. ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJTHI
دوره 2 شماره
صفحات -
تاریخ انتشار 2006